Efficient Optimization of Performance Measures 
by Classifier Adaptation 



Nan Li 1 ' 2 , Ivor W. Tsang 3 , Zhi-Hua Zhou 1 ' 



3 



1 National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210046, China 
3 School of Mathematical Scicences, Soochow University, Suzhou 215006, China 
2 School of Computer Engineering, Nanyang Technological University, 639798, Singapore 



Abstract 

In practical applications, machine learning algorithms are often needed to learn classifiers that 
optimize domain specific performance measures. Previously, the research has focused on learning 
the needed classifier in isolation, yet learning nonlinear classifier for nonlinear and nonsmooth 
performance measures is still hard. In this paper, rather than learning the needed classifier 
by optimizing specific performance measure directly, we circumvent this problem by proposing 
a novel two-step approach called as CAPO, namely to first train nonlinear auxiliary classifiers 
with existing learning methods, and then to adapt auxiliary classifiers for specific performance 
measures. In the first step, auxiliary classifiers can be obtained efficiently by taking off-the-shelf 
learning algorithms. For the second step, we show that the classifier adaptation problem can be 
reduced to a quadratic program problem, which is similar to linear SVM perf and can be efficiently 
solved. By exploiting nonlinear auxiliary classifiers, CAPO can generate nonlinear classifier which 
optimizes a large variety of performance measures including all the performance measure based on 
the contingency table and AUC, whilst keeping high computational efficiency. Empirical studies 
show that CAPO is effective and of high computational efficiency, and even it is more efficient 
than linear SVMP erf . 
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1. Introduction 



In real- world applications, different user requirements often employ different domain specific 
performance measures to evaluate the success of learning algorithms. For example, Fl-score and 
Precision-Recall Breakeven Point (PRBEP) are usually employed in text classification; Precision 
and Recall are often used in information retrieval; Area Under the ROC Curve (AUC) and 
Mean Average Precision (MAP) are important to ranking. Ideally, to achieve good prediction 
performance, learning algorithms should train classifiers by optimizing the concerned performance 
measures. However, this is usually not easy due to the nonlinear and nonsmooth nature of many 
performance measures like Fl-score and PRBEP. 

During the past decade, many algorithms have been developed to optimize frequently used per 



formance measures, and they have shown better performance than conventional methods 
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191. 1151. 151. 141] . By now, the research has focused on training the needed classifier in isolation. But, 
in general, it is still challenging to design general-purpose learning algorithms to train nonlinear 
classifiers optimizing nonlinear and nonsmooth performance measures, though it is very needed 
in practice. For ™ple, SVM- proposed by Jo ac himS Q can efficient opting a la rge 
variety of performance measures in the linear case, but its nonlinear kernelized extension suffers 
from computational problems j^l], III ] • 

In this paper, rather than directly designing sophisticated algorithms to optimize specific perfor- 
mance measures, we take a different strategy and present a novel two-step approach called CAPO 
to cope with this problem. Specifically, we first train auxiliary classifiers by exploiting existing 
off-the-shelf learning algorithms, and then adapt the obtained auxiliary classifiers to optimize 
the concerned performance measure. Note that in the literature, there have been proposed many 
algorithms that can train the auxiliary classifiers quite efficiently, even on large-scale data, thus 
the first step can be easily performed. For the second step, to make use of the auxiliary classifiers, 
we consider the classifier adaptation problem under the function-level adaptation framework 29| , 



and formulate it as a quadratic program problem which is similar to linear SVM perf [15j] and can 
also be efficiently solved. Hence, in total, CAPO can work efficiently. 

A prominent advantage of CAPO is that it is a flexible framework, which can handle different 
types of auxiliary classifiers and a large variety of performance measures including all the per- 
formance measure based on the contingency table and AUC. By exploiting nonlinear auxiliary 



classifiers, CAPO can train nonlinear classifiers optimizing the concerned performance measure 
with low computational cost. This is very helpful, because nonlinear classifiers are preferred in 
many real-world applications but training such a nonlinear classifier is often of high computa- 
tional cost (e.g. nonlinear kernelized SVM perf ). In empirical studies, we perform experiments on 
data sets from different domains. It is found that CAPO is more effective and more efficient than 
state-of-the-art methods, also it scales well with respect to training data size and is robust with 
the parameters. It is worth mentioning that the classifier adaptation procedure of CAPO is even 
more efficient than linear SVM perf , though it employs the same cutting-plane algorithm to solve 
the classifier adaptation problem. 

The rest of this paper is organized as follows. Section [2] briefly describes some background, includ- 
ing the problem studied here and SVM pcrf . Section [3] presents our proposed CAPO approach. 
Section U] gives some discussions on related work. Section [5] reports on our empirical studies, 
followed by the conclusion in Section [6j 

2. Optimizing Performance Measures 

In this section, we first present the problem of optimizing performance measures, and then intro- 
d U ce SVMP- B and itS — 




2.1. Preliminaries and Background 

In machine learning tasks, given a set of n training examples D = {(xi, y%), . . . , (x n , y n )}, where 
Xj G X and yi E {— 1,+1} are input pattern and its class label, our goal is to learn a classifier 
/(x) that minimizes the expected risk on new data sample S = {(x^,^), . . . , (x' m ,y' m )}, i.e., 



where A((y' 1 , . . . , y' m ), (/(x'j), . . . , f(x' m ))) is the loss function which quantifies the loss of / on 
S. Subsequently, we use the notation A(/; S) to denote A((y' l5 . . . , y' m ), (/(x^), . . . , /(x' m ))) for 
convenience. Since it is intractable to compute the expectation Eg[«], discriminative learning 
methods usually approximate the expected risk using the empirical risk 




R A (f) = E s [A((y[, ...,y' m ), (/(x^), . . . , /(x^)))] 



R%(f) = A(f;D) 
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which measures /(x)'s loss on the training data D, and then train classifiers by minimizing 
empirical risk or regularized risk. In practice, domain specific performance measures are usually 
employed to evaluate the success of learnt classifiers. Thus, good performance can be expected 
if the classifiers are trained by directly optimizing the concerned performance measures. Here, 
we are interested in regarding the loss function A as practical performance measures (e.g., Fl- 
score and PRBEP), instead of some kinds of surrogate functions (e.g., hinge loss and exponential 
loss). In this situation, the loss function A can be nonlinear and nonsmooth function of training 
examples in D, thus it is computationally challenging to optimize the empirical risk A in practice. 

In the literature, some methods have been developed to optimize frequently- used performance 



measures, such as AUC , Fl-score 

Q, NDCG and MAP Q, fl Q- 

Among ex- 
isting methods that try to optimize performance measures directly, the SVM perf proposed by 
Joachims [15] is a representative example. One of its attractive advantages is that by employing 
the multivariate prediction framework, it can directly handle a large variety of performance mea- 
sures, including AUC and all measures that can be computed from the contingency table, while 
most of other methods are specially designed for one specific performance measure. Subsequently, 
we describe it and also show its limitation. 



2.2. SVM? er f and Its Kernelized Extension 

Since many performance measures cannot be decomposed over individual predictions, SVM perf 
takes a multivariate prediction formulation and considers to map a tuple of n patterns x = 
(xi, . . . , x n ) to a tuple of n class labels y = (yi, . . . , y n ) by 

where y n C {— l,+l} n is set of all admissible label vectors. To implement this mapping, it 
exploits a discriminant function and makes prediction as 

/(x) = argmax w T *(x,y) , (1) 

gl<=yn 

where w is a parameter vector and ^(x, y') is a feature vector relating x and y'. Obviously, the 
computational efficiency of the inference ([I]) highly depends on the form of the feature vector 
*(x,y') . 
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Algorithm 1 Cutting-plane algorithm for training linear SVM pcrf [1 
1: Input: D = {(Xi, yi )y? =1 , C, e 

2: 

3: repeat 

4: (w,£) <r- argmin w5 > i||w|| 2 + C£ 

s.t. Vy'eW: w T [vI/(x, y) - ¥(x, y')] > A(y, y') - £ , 
5: find the most violated constraint by y' <— arg maXg// g -yn{A(y, y") + w T ^(x, y")} 
6: W^WU {y'} 

7: until A(y,y') -w T [*(x,y) -*(x,y')] <£ + e 
Linear Case 

In the feature vector \P(x, y') is restricted to be 

n 

*(x,y') = 



thus the argmax in (pQ) can be achieved by assigning y- to sign(w T Xj), leading to a linear classifier 
/(x) = sign[w x]. To learn the parameter w, the following optimization problem is formulated 

mm I|| w || 2 +C£ (2) 

w,£>0 Z 

s.t. Vy'ey n \y: w T [*(x,y) - *(x,y')] > A(y,y') - £ , 

where A(y, y') is the loss of mapping x to y' while its true label vector is y. It is not hard to 
find that A(y,y') can incorporate many types of performance measures, and the problem ^ 
optimizes an upper bound of the empirical risk [lj]] . 

While there are a huge number of constraints in ([2]), the cutting-plane algorithm in Algorithm Q] 
can be used to solve it, and this al gori thm has been shown to need at most 0(l/e) iterations to 
converge to an e-accurate solution [id. \l<\. In each iteration, it needs to find the most violated 
constraint by solving 

argmax {A(y, y') + w T ^(x, y')} . (3) 

y'ey n 

It has been shown that if the discriminant function w Tx I / (x, y') can be written in the form 
^27=1 2/i/( x i)> the inference ([3]) can be solved for many performance measures in polynomial 
time, that is, 0(n 2 ) for contingency table based performance measures (such as Fl-score) and 
0(n log n) for AUC [lj] . Hence, Algorithm [1] can train SVM perf in polynomial time. 



2.2.2. Kernelized Extension 



Using kernel trick, the linear SVM perf described above can be extended to the non-linear case 
It is easy to obtain that the dual of ([2]) as 



max 

Q>0 



1 



(4) 



S.t. Oyl = C , 

where a is the column vector of a^/'s and H is the Gram matrix with the entry H(y',y") as 

H(y', y") = [*(x, y) - *(x, y')] T [*(x, y) - (x, y") 

By replacing the primal problem with its dual in Line 4, it is easy to get the dual variant of 
Algorithm [TJ which can solve the problem @ in at most 0(l/e) iterations [ltl \\\ . In the 
solution, each ag corresponds to a constraint in W, and the discriminant function w T ^(x, y') in 
(JTJ) can be written as 

w T *(x,y')= Y, « s »[*(x,y)-*(x,/)] T *(x,yO . 

Obviously, the inner product ^(x, 2/ / ) T ^I'(x, y") can be computed via a kernel i^(x, y', x, y"). 
However, if so, it can be found that the argmax in ([TJ and ([3]) will become computationally 

intractable. Hence, feature vectors of the following form are used 

n 

where $(xi) T 3>(xj) can be computed via a kernel function JC(xj,Xj) = <E>(xj) T <3?(xj). Then, the 
discriminant function becomes 



n n 



w T *(x,y') = Y,y'iY,Pi K ^*i) > 

i=l j=i 



(5) 



where /3j = ^£» e w ~~ I n this case, the argmax in (TTJ can be achieved by assigning 



each y[ with sign YJj=\ PjK(Xi,Xj 



which produces the kernelized classifier 



/(x) = sign 



y, PiK(x, Xj 



.1=1 
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However, in each iteration, the Gram matrix H needs to be updated by adding a new row/column 
for the new constraint. Suppose y + is added, for every y' S W, it requires computing 

n n 

i=\ j=i 

Thus, let m denote the number of constraints in W and n denote the data size, it takes 0(mn 2 ) 
kernel evaluations in each iteration. Also, it should be noted that computing the discriminative 
function §5§ also requires 0(n 2 ) kernel evaluations, and this adds to the computational cost of the 
inference ([3]). These issues make the kernelized extension of SVM pcrf suffer from computational 
problems, even on reasonably-sized data set. However, as we know, nonlinear classifiers are quite 
needed in many practical application. Hence, training nonlinear classifier that optimizes a specific 
performance measure becomes central to this work. 

3. Classifier Adaptation for Performance Measures 

In this section, we introduce our proposed approach CAPO, which is short for Classifier Adap- 
tation for Performance measures Optimization. 

3.1. Motivation and Basic Idea 

Notice the fact that it is generally not straightforward to design learning algorithms which opti- 
mize specific performance measure, while there has been many well-developed learning algorithms 
in the literature and some of them can train complex nonlinear classifiers quite efficiently. Our 
intuitive motivation of this work is to exploit these existing algorithms to help training the needed 
classifier that optimizes the concerned performance measure. 

Specifically, denote /*(x) as the ideal classifier which minimizes the empirical risk A(f;D), it is 
generally not easy to design algorithms which can efficiently find /*(x) in the function space by 
minimizing A(/; D) due to its nonlinear and nonsmooth nature, especially when we are interested 
in complex nonlinear classifiers. Meanwhile, by using many off-the-shelf learning algorithms, we 
can get certain classifier /'(x) quite efficiently, even on large-scale data set. Obviously, /'(x) can 
differ from the ideal classifier /*(x), since it may optimize a different loss from A(/; D). However, 
since many performance measures are closely related, for example, both Fl-score and PRBEP 
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are functions of precision and recall, the average AUC is an increasing function of accuracy 
/'(x) can be regarded as a rough estimated classifier of /*(x), then we conjecture that /'(x) will 
be helpful to finding /*(x) in the function space, for example, it can reduce the computational 
cost of searching the whole function space. Subsequently, /'(x) is called as auxiliary classifier 
and /*(x) as target classifier. 



To implement this motivation, we take classifier adaptation techniques [2l|, |30j] which have 
achieved successes in domain adaptation [9J. Specifically, after getting the auxiliary classifier 
/'(x), we adapt it to a new classifier /(x) and it is expected that the adapted classifier /(x) 
can achieve good performance in terms of the concerned performance measure. For the classifier 
adaptation procedure, it is expected that 

• The adapted classifier outperforms auxiliary classifier in terms of concerned performance 
measure; 

• The adaptation procedure is more efficient than directly training a new classifier for con- 
cerned performance measure; 

• The adaptation framework can handle different types of auxiliary classifiers and different 
performance measures. 

Since many existing algorithms can train auxiliary classifiers efficiently, we focus on the classifier 
adaptation procedure in the remainder of the paper. 

3.2. Classifier Adaptation Procedure 

For the aim of this work, we study the classifier adaptation problem under the function-level 



adaptation framework, which is originally proposed for domain adaption in 



30, 



29]. 



3.2.1. Single Auxiliary Classifier 

The basic idea is to directly modify the decision function of auxiliary classifier which can be of 
any type. Concretely, given one auxiliary classifier /'(x), we construct the new classifier /(x) by 
adding a delta function fs(x) = w T $(x), i.e., 

/(x) = sign [f(x) + w T $(x) 
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where w is the parameter of fs(x), and $(•) is a feature mapping. It should be noted that /'(x) 
is the auxiliary classifier directly producing +1/-1 predictions, and it can be of any type (e.g., 
SVM, neural network, decision tree, etc) because it is treated as a "black-box" in CAPO; while 
/<s(x) is a real- valued function, which is added to modify the decision of /'(x) such that /(x) can 
achieve good performance in terms of our concerned performance measure. Obviously, our task 
is reduced to learn the delta function fs(x), and hence the classifier /(x). 

Based on the principle of regularized risk minimization, it should consider the problem 

min 0(w) + C- A(y,f) , (6) 

w 

where O(w) is a regularization term, A(y,y') is the empirical risk on training data D with 
y = (yx, . . . , y n ) are the true class labels and y* = (/(xi), . . . , /(x„)) are the predictions of /(x), 
and C is the regularization parameter. In practice, the problem (J6j) is not easy to solve, mainly 
due to the following two issues: 

1. For some multivariate performance measures like Fl-score, the empirical risk A cannot 
be decomposed over individual predictions, i.e., they cannot be written in the form of 
A(y,y / ) = Er=i^,Mx,)) ; 

2. The empirical risk A can be nonconvex and nonsmooth; 

To cope w ith th e s e isS oes, ,n 8plr eo by SVM- Q, we tafce the mU lWia,e predion ta ola- 
tion. That is, instead of learning /(x) : X i— > y directly, we consider / : X n i— > y n which maps a 
tuple of n patterns x = (xi, . . . , x n ) to n class labels y = (yi, . . . , y n ). Specifically, the mapping 
is implemented by maximizing a discriminant function F(5c,y), i.e., 

y = argmaxF(x, y') . (7) 

y'ey n 

In this work, F(x,y) = Ya=1 2/«/( x «) ^ s used, so the argmax in ([7]) can be easily obtained by 
assigning y\ with /(x). In this way, ([7]) becomes 



y = arg max 

y'(zyn 



n 

T(x,y), where T(x,y) = ^yi 

i=l 



Instead of directly minimizing A(y,y'), we consider its convex upper bound as follows. 
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Proposition 1 Given training data D and the discriminative funciton F(x,y) ; the risk function 

R(w; D) = max [F(x, y') - F(x, y) + A(y, y')] (8) 

y SJ/ n 

is a convex upper bound of the empirical risk A(y,y*) with y* = argmaxg/ g -yn F(5L,y). 

Proof: The convexity of d5|) with respect to w is due to the fact that F is linear in w and a 
maximum of linear functions is convex. Since y* = arg max^/ g -y„ F(x,y), it follows 

R(w; D) > F(x, y*) - F(x, y) + A(y, f) > A(y, y*) . 

Thus, -R(w; D) is a convex upper bound. □ 

Consequently, by taking f2(w) = ||w|| 2 and the convex upper bound R(w;D), the problem §§§ 
becomes 

min -||w|| 2 +C£ (9) 



w,£>0 2 

T 

1 ' 



s.t. V y' G y n \ y : 



w 



[T(x,y,-T(x,y')]>A(y,y')-£ 



where £ is a slack variable introduced to hide the max in (JHJ). 

Although the regularization term ||w|| 2 has the same form as that of SVM pcrf in ([2]), it has a 
different meaning, as stated in following proposition. 

Proposition 2 By minimizing the regularization term ||w|| 2 in (0|), the adapted classifier /(x) 
is made to be near the auxiliary classifier /'(x) in reproducing kernel Hilbert space. 

Proof: The Lagrangian function of ([9|) is 



L = ^||w|| 2 + [c - 7 - "if K " E a S' 



/ 


1 


T 






[T(x,y) 


V 


w 





[T(x,y)-T(x,y')]-A(y,yO 



S'ey™ / y'ey n 

where ay and 7 are Lagrangian multipliers. By setting the derivative of L with respect to w to 
zero, we obtain 

n n 

w = ^/3i$( Xi ) and / a (x) = faKfe, x) , 

i=l i=l 

where ft = ]Cy'e;y™ ay(yi - and if(xj,x) = $(x;) T $>(x). 
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Since /(x) = /'(x) + /<s(x), the distance between / and /' in RKHS is 

11/ -/T HIM 
Meanwhile, since w = Ya=1 A^K x i)i we have 



(/ 5 ,/ <5 } = ^^A/3^(x i ,x i ) . 
i=i i=i 



mi 2 = EE&& $ ( x *) T$ ( x *) • 

By computing <I>(xj) T $(xj) via the kernel K(pc{, Xj), we can obtain ||/ — / 
completes the proof. 



/||2 



|w|| 2 , which 
□ 



In summary, by solving the problem ([9]), CAPO finds the adapted classifier /(x) near the auxiliary 
classifier /'(x) such that /(x) minimizes an upper bound of the empirical risk, and the parameter 
C balances these two goals. 



3.2.2. Multiple Auxiliary Classifiers 

If there are multiple auxiliary classifiers, rather than choosing one, we learn the target classifier 
by leveraging all the auxiliary classifiers. A straightforward idea is to construct an ensemble 
of them, then the ensemble is treated as a single classifier to be adapted. Suppose we have m 
auxiliary classifiers / 1 (x), . . . , / m (x), the target classifier /(x) can be formulated as 



/(x) = sign 



X>r(x)+w T $(x) 



i=l 



(10) 



where aj is the weight of the auxiliary classifier / l (x), and /^(x) = w T( l>(x) is the delta function 
as above. We learn the ensemble weights a = [ai,...,a m ] T and the parameter w of /s(x) 
simultaneously. Let fj = [/ 1 (xj), . . . , / m (xj)] T and 



Vi 



i=i 



Tfxi 



Following the same strategy as above, the following problem is formulated. 



min — II wll 2 + -5||a|| 2 + C£ 

a,w,^>0 2 2 

T 



s.t. V y' G y n \ y : 



[^(x,y)-^(x,y / )] > A(y,y')-e 
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where ||a|| 2 penalizes large weights on the auxiliary classifiers. It prevents the target classifier /(x) 
from too much reliance on the auxiliary classifiers, because they do not directly optimize the target 
performance measure. The term ||w|| 2 measures the distance between /(x) and X}i=i a */*( x ) m 
the function space. Thus, minimizing ^||w|| 2 finds the final classifier /(x) near the ensemble 
of auxiliary classifiers YliLi a if l { x ) m the function space. The two goals are balanced by the 
parameter B. Hence, in summary, it learns an ensemble of auxiliary classifiers, and seeks the 
target classifier near the ensemble such that the risk in terms of concerned performance measure 
is minimized. 



3.2.3. Efficient Learning via Feature Augmentation 



Obviously, in CAPO, the auxiliary classifier /'(x) can be nonlinear classifiers such as SVM and 
neural network, thus the adapted classifier /(x) is nonlinear even if the delta function f$(x) is 
linear. Empirical studies in Section [5] show that using linear delta function /^(x) achieves good 
performance whilst keeping computational efficiency. 

Consider linear delta funcion, i.e., $(x) = x and f$(x) = w T x, and take CAPO with multiple 
auxiliary classifiers for example, if we augment the original features with outputs of auxiliary 
classifiers, and let 



\[~B a 



w 



and x' 



J- f 



X; 



(12) 



the adaptation problem (fTTj) can be written as 

1 



nun — ilvll + C£ 
v,£>o 2" 11 s 



(13) 



s.t. V y' e y n \ y : 

n n 



> A(y,y')-^. 



li=l i=l 

For CAPO with one auxiliary classifier, it is easy to find that there exist a constant B such that 
the adaptation problem Q can also be transformed into problem (I13p if we define 



'B 
w 



and x' 



^- f 



(14) 
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Note that the problem (I13j) is the same as that of linear SVM pcrf in (J2J) . Thus, after obtaining 
auxiliary classifiers, if we augment the original data features with the outputs of auxiliary clas- 
sifiers according to (|12p or (|14p . the classifier adaptation problem of CAPO can be efficiently 
solved by the cutting plane algorithm in Algorithm (TJ Obviously, as linear SVM pcrf , CAPO can 
also handle all the performance measures based on the contingency table and AUC. 

In practice, CAPO is an efficient approach for training nonlinear classifiers optimizing specific 
performance measures, because its both steps can be efficiently performed. Moreover, because 
auxiliary classifiers can be seen as estimation of the needed classifier, it can be expected that 
Algorithm [T] needs fewer iterations to converge, i.e. fewer times of solving the inference ([3D; and 
hence its classifier adaptation procedure can be more efficient than linear SVM perf which searches 
the function space directly. This has been validated by the experimental results in Section 15.21 



4. Discussion with Related Work 

The most famous work that optimizes performance measures is SVM perf [l5j]. By taking a multi- 
variate prediction formulation, it finds the classifier in the function space directly. Our proposed 
CAPO works in a different manner and employs auxiliary classifiers to help find the target clas- 
sifier in the function space. Furthermore, CAPO is a framework that can use different types of 
auxiliary classifiers. If nonlinear auxiliary classifier is used, the obtained classifier will also be 
nonlinear. This is very helpful, because nonlinear classifier is preferred in many applications while 
training nonlinear SVM perf is computationally expensive. In summary, compared with SVM perf , 
CAPO can provide the needed nonlinearity whilst keeping even improving computational effi- 
ciency. 



Another related work is A-SVM 3G(], which learns a new SVM classifier by adapting auxiliary 
classifiers trained in other related domains. CAPO differs from A-SVM in several aspects: 1) 
CAPO aims to optimize specific performance measures, while A-SVM considers hinge loss; 2) 
The auxiliary classifiers of CAPO are used to help find the target classifier in the function space, 
while A-SVM is proposed for domain adaptation [9| and it employs auxiliary classifier to extract 



knowledge from related domains, similar ideas can be found in 



10l | . Generally speaking, classifier 



adaptation techniques which try to obtain a new classifier based on existed classifiers, were 
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mainly used for domain adaptation in previous studies 0,0]. Here, we use classifier adaptation 
to optimize specific performance measures, which is quite different. 



231, 



Ensemble learning is the learning paradigm which employs multiple learners to solve one task 
and it achieves state-of-the-art performance in many practice applications. In current work, the 
final classifier generated by CAPO is an ensemble constituting of auxiliary classifiers and the 
delta function. But, different from conventional ensemble methods, the component classifiers of 
CAPO are of two kinds and generated in two steps: first, auxiliary classifiers are trained; then a 
delta function which is designed to correct the decision of auxiliary classifiers is added such that 
the concerned performance measure is optimized. 

From the feature augmentation perspective, the nonlinear auxiliary classifiers construct nonlinear 
features that are augmented to the original features, so that the final classifier can have nonlinear 

n 

generalization performance. This is like constructive induction [22| which tries to change the 
representation of data by creating new features. 



Ek 



Curriculum learning [2j is a learning paradigm which circumvents a challenging learning task by 
starting with relatively easier subtasks; then with the help of learnt subtasks, the target task 
can be effectively solved. It was first proposed for training neural networks in [11], and is closely 
related to the idea of "twice learning" proposed in [341, where a neural network ensemble was 
trained to help induce a decision tree. The study in [2j shows promising empirical results of 
curriculum learning. Our proposed CAPO is similar to curriculum learning since it also tries to 
solve a difficult problem by starting with relatively easier subtasks, but they are quite different 
because we do not provide a curriculum learning strategy. 



5. Empirical Studies 



In this section, we perform experiments to evaluate the performance and efficiency of CAPO. 



5. 1 . Configuration 

The following five data sets from different application domains are used in our experiments. 
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Table 1: Data sets used in the experiments. 



Data set 


#Feature 


#Train 


#Test 


IjCNNl 


22 


49,990 


91,701 


Mitfaces 


361 


6,977 


24,045 


Reuters 


8,315 


7,770 


3,299 


Splice 


60 


1,000 


2,175 


Usps* 


256 


7,291 


2,007 



• IjcnnI: This data set is from Ijcnn 2001 neural network competition (task 1), here we 
use winner's transformation in [7J]. 

• Mitfaces: Face detection data set from CBCL at MIT [l|. 

• Reuters: Text classification data which is to discriminate the money-fx documents from 
others in the Reuters-21578 collection. 

• Splice: The task is to recognize two classes of splice junctions in a DNA sequence. 

• USPS*: This data set is to classify the digits "01234" against the digits "56789" on the 
USPS handwritten digits recognition data. 

Table Q] summarizes the information of data sets. On each data set, we optimize 4 performance 
measures (accuracy, Fl-score, PRBEP and AUC) so there are 20 tasks in total. For each task, 
we train classifiers on training examples, and then evaluate their performances on test examples. 
The experiments are run on an Intel Xeon E5520 machine with 8GB memory. 

5.2. Comparison with State-of-the-art Methods 

First, we compare the performance and efficiency of CAPO with state-of-the-art methods. Specif- 
ically, we compare three methods which can optimize different performance measures, including 
SVM perf , classification SVM incorporating with a cost model [^J, and our proposed CAPO. 
Detailed implementations of these methods are described as follows. 

• CAPO: We use three kinds of classifiers as auxiliary classifiers, including Core Vector Ma- 
chine (CVM)Q Q, RBF Neural Network (NN) fl] and C4.5 Decision Tree (DT) 25], and 



1 http : //www. cs .ust .hk/~ivor/cvm. html. Here, we use the option "-c 1 -e 0.001" for all auxiliary CVMs. 
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corresponding CAPO's are denoted as CAPO CV m, CAPO nn and CAPOdt, respectively. In 
CAPO cvm , the CVM is with RBF kernel fc(xj;xj) = exp(7||xj — Xj|| 2 ), where 7 is set to the 
default value (inverse squared averaged distance between examples), and the parameter C 
is set to 1. In CAPO nn and CAPO dt , NN and DT are implemented by WEKA [13] with 
default parameters. Furthermore, we also implement CAPO*, which exploits all the three 
auxiliary classifiers. The parameter C is selected from C G {2 , . . . , 2 7 } by 5-fold cross 
validation on training data, and the parameter B of CAPO* is simply set to 1. 

SVMP erf : We use the codes of SVM pcrf provided by Joachims q Both linear kernel and 
RBF kernel are used, the corresponding methods are denoted as SVM^ rf and SVM^f, 
respectively. The parameter C for both methods and the kernel width 7 for SVMj 3 ^ are 
selected from C G {2~ 7 , . . . , 2 7 } and 7 € {2~ 2 7o, . . . , 2 2 7o} by 5-fold cross validation on 
training data, where 70 is the inverse squared averaged distance between examples. 



[light [J 



SVM with cost model: We implement the SVM with cost model with SVM g where 
the parameter j is used to set different costs for different classes. Specifically, we use 
SVMjj^ ht and SVMj. 1 ^, where linear kernel and RBF kernel are used. The parameter C and 
j for both methods and the kernel width 7 for SVMj. 1 ^ are selected from C G {2~ 7 , . . . , 2 7 }, 
j G {2~ 2 , . . . , 2 6 } and 7 G {2~ 2 7 , . . . , 2 2 7 } by 5-fold cross validation. 



For parameter selection, we extend the search space if the most frequently selected parameter 
was on a boundary. Note that both SVM with cost model and SVM perf are strong baselines to 
compare against. Lewis [20J won the TREC-2001 batch filtering evaluation by using the former, 
and Joachims [15] showed that SVM pcrf performed better. We apply these methods to the 20 
tasks mentioned above, and report their performance. Since time efficiency is also concerned, we 
report the CPU time used for parameter selection. Note that if one task is not completed in 24 
hours, we would stop it and mark it with "N/A". 

TableOpresents the performance of compared methods as well as the raw performance of auxiliary 
classifiers (in the brackets following the entries of corresponding CAPO methods), where the 
best result for each task is bolded. It is obvious that CAPO and SVM p ; ^ rf succeed to finish 

2 http : //svmlight . j oachims . org/svm_perf . html. 
3 http : //svmlight . j oachims . org. 
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Table 2: Performance of compared methods, where the best performance for each task is bolded and the methods 
that cannot be completed in 24 hours are indicated by "N/A". For CAPO, the raw performance of auxiliary 
classifier is shown in brackets following the entry of corresponding CAPO. 



Task 


CAPOcvm CAPO dt CAPOnn CAPO* 


SVMf ; ° rf SVM^f 


SVMjif * SVM^ f ht 


Accuracy 

S PRBEP 
AUC 


.9540 (.9521) .9702 (.9702) .9150 (.8914) .9703 
.7620 (.7544) .8473 (.8471) .5753 (.2643) .8468 
.7723 (.7376) .8470 (.8364) .5692 (.3222) .8605 
.9607 (.8839) .9734 (.9464) .9198 (.8658) .9810 


.9193 .9658 
.5565 N/A 
.6016 N/A 
.9180 N/A 


N/A N/A 
N/A N/A 
N/A N/A 
N/A N/A 


Accuracy 

CD 

3 Fl 

| PRBEP 
AUC 


.9842 (.9839) .9458 (.9302) .9696 (.9067) .9841 
.4658 (.4665) .1605 (.1342) .2281 (.1768) .4514 
.5127 (.4979) .1864 (.1822) .2500 (.1059) .4873 
.9148 (.9148) .7991 (.7201) .8368 (.7979) .9137 


.9727 .9840 
.2056 N/A 
.2140 N/A 
.8533 N/A 


.9733 N/A 
.2015 N/A 
.2309 N/A 
.8450 N/A 


Accuracy 

CO 

'B Fl 

(S PRBEP 
AUC 


.9745 (.9745) .9664 (.9660) .9715 (.9315) .9739 
.7730 (.7729) .6973 (.6890) .7455 (.1439) .7731 
.7654 (.7709) .7207 (.6871) .7151 (.3743) .7765 
.9870 (.9363) .9842 (.9144) .9868 (.8322) .9838 


.9727 .9727 
.7375 N/A 
.7598 N/A 
.9878 N/A 


.9724 .9721 
.7599 .7540 
.7709 .7598 
.9872 .9873 


Accuracy 

CD 

& PRBEP 
AUC 


.8947 (.8947) .9347 (.9347) .9651 (.9651) .9664 
.8955 (.8943) .9371 (.9362) .9659 (.9659) .9512 
.8762 (.8691) .9363 (.9355) .9576 (.9558) .9584 
.9457 (.8992) .9760 (.9307) .9836 (.9667) .9852 


.8451 .8947 
.8451 N/A 
.8532 N/A 
.9304 N/A 


.8446 .8975 
.8487 .8990 
.8523 .9036 
.9267 .9639 


Accuracy 

>= PRBEP 
AUC 


.9691 (.9689) .9233 (.9233) .8520 (.7798) .9676 
.9611 (.9613) .9060 (.9053) .8188 (.7486) .9617 
.9500 (.9488) .9000 (.8898) .8195 (.7500) .9573 
.9731 (.9658) .9557 (.9179) .9137 (.7582) .9843 


.8411 .9706 
.8012 N/A 
.7963 N/A 
.9052 N/A 


N/A N/A 
N/A N/A 
N/A N/A 
N/A N/A 



all tasks in 24 hours. We can observe that CAPO achieves performance improvements over 
auxiliary classifiers on most tasks, and many of the performance improvements are quite large. 
For example, on Reuters the best AUC achieved by auxiliary classifiers is 0.9363, while CAPO 
methods achieve AUC higher than 0.98. This result shows that CAPO is effective in improving 
the performance with respect to the concerned performance measure. More results for the case of 
multiple auxiliary classifiers is given in Section 15.31 Moreover, we could see from the results that 
CAPO methods perform much better than linear methods, i.e., SVM^ rf and SVMj|^ ht , especially 
when optimizing multivariate performance measures like Fl-score and PRBEP. For example, 
CAPO* achieves PRBEP 0.8605 but SVM^ rf achieves only 0.6016 on IJCNNl; CAPO nn achieves 
Fl-score 0.9659, but that of SVMj^ r and SVM}|^ ht are both less than 0.85 on Splice. This can 
be explained by that CAPO methods exploit the nonlinearity provided by auxiliary classifiers. 
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Table 3: CPU time for parameter selection (in seconds), where the tasks not completed in 24 hours are indicated 
by "N/A". For CAPO, the CPU time for training auxiliary classifiers is not counted, and they are shown in Tabled 



Task 


CAPOcvm CAPO dt CAPOnn CAPO* 


SVMPjf SVM^f 


SVMjjf* SVM^ g f ht 


Accuracy 
| Fl 
£ PRBEP 

AUC 


9.3 11.1 9.9 11.2 
9,451.5 9,011.5 14,809.3 6,652.8 
1,507.9 1,033.3 2,276.2 1,005.1 
88.0 38.0 124.0 40.6 


10.0 96.6 
12,281.3 N/A 
2,034.0 N/A 
112.6 N/A 


N/A N/A 


Accuracy 

CD 

g Fl 
| PRBEP 
AUC 


9.5 11.2 23.7 9.0 
465.6 802.5 1,211.5 379.0 
126.9 183.4 241.6 119.6 
37.7 48.5 74.0 30.6 


27.2 27,089.3 
1,189.4 N/A 

234.4 N/A 

79.3 N/A 


6,114.7 N/A 


Accuracy 

CO 

3 Fl 
(g PRBEP 
AUC 


5.7 2.1 2.6 3.9 

68.7 67.4 64.3 67.6 

10.8 13.1 11.9 10.6 

18.9 8.6 8.7 3.9 


2.3 39,813.1 
60.2 N/A 
11.4 N/A 

8.1 N/A 


283.1 53,113.8 


Accuracy 

CD 

ra PRBEP 
AUC 


4.0 484.5 697.1 2.0 
168.2 592.3 3,373.9 58.4 
11.8 17.0 27.3 6.8 
2.0 3.3 7.3 1.2 


3,602.4 2,187.1 
10,201.5 N/A 
82.6 N/A 
42.0 N/A 


16,297.6 464.2 


Accuracy 

2 « 

t= PRBEP 
AUC 


24.6 35.4 215.3 15.6 
2,199.0 2,605.4 5,429.9 1,514.8 
626.2 566.1 938.9 404.4 
155.6 139.9 424.3 76.1 


221.5 24,026.7 
5,225.9 N/A 
895.2 N/A 
452.5 N/A 


N/A N/A 



Table 4: CPU time for training auxiliary classifiers (in seconds). 



Data set 


CVM 


DT 


NN 


IjCNNl 


1.6 


19.9 


20.2 


Mitfaces 


2.8 


66.1 


63.6 


Reuters 


2.1 


1,689.7 


1,771.0 


Splice 


0.1 


0.4 


0.9 


Usps* 


2.3 


45.9 


37.1 



Meanwhile, it is interesting that all methods achieve similar performances on Reuters, this 
coincides with the common knowledge that linear classifier is strong enough for text classification 
tasks. For kernelized methods, i.e., SVM^[ f and SVMj. 1 ^ 1 *, it is easy to see that they fail to 
finish in 24 hours on most tasks. On the smallest data set Splice, SVMl. 1 ^ succeeds to finish 
all tasks, its performance is better than linear methods (SVM^ rf and SVMjj^ ht ), this can be 
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-6 -4 -2 2 4 6 -6 -4 -2 2 4 6 -6 -4 -2 2 4 6 -6 -4 -2 2 4 6 

log,C log 2 C log 2 C log 2 C 

(b) Number of inferences on Reuters 

Figure 1: Number of inferences of the most violated constraints (^Inference) when training SVMf;° rf , CAPOc^ 
and CAPO* on USPS* and Reuters, where x-axis and y-axis show the C values and ^Inference respectively. 



explained that SVMj^ 1 * exploits nonlinearity by using RBF kernel. Meanwhile, it is easy to see 
that the performances of CAPO methods especially CAPO* are superior to SVMj?i lt . This can be 
understood that RBF kernel may not be suitable for this data, while CAPO* exploits nonlinearity 
introduced by different kinds of auxiliary classifiers. By comparing CAPO* with other CAPO 
methods with one auxiliary classifier, it can be found there are many cases where CAPO* performs 
better. This is not hard to understand because CAPO* exploits more nonlinearity by using 
different kinds of auxiliary classifiers. 

Table [3] shows the CPU time used for parameter selection via cross validation. On each data 
set, we employ the same auxiliary classifiers for four different measures, so the time used for 
training auxiliary classifiers on one data set are identical, which are shown in Table HI Also, 
because four tasks of SVM llght on one data set have the same cross validation process, they have 
the same cross validation time. From Table [3] and HI we can see kernelized nonlinear methods 
(SVM^[ f and SVMjX?*) fail to finish in 24 hours on most tasks. This can be understood that the 
Gram matrix updating in SVMS costs much time as described in Section [272l and SVMj. 1 ^ 1 * has 
many parameters to tune. Meanwhile, it can be found that CAPO methods are more efficient 
than others, even after adding the time used for training auxiliary classifiers. 

Moreover, it is interesting to find that the classifier adaptation procedure of CAPO costs much 
less time than SVMP° rf except on Reuters, though it employs the later to solve the adapta- 
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tion problem. For example, when optimizing Fl-score on Splice, CAPO* consumes only 58.4 
seconds for cross validation while SVM^ rf costs more than 10,000 seconds. To understand this 
phenomenon, we record the number of inferences of the most violated constraint (i.e. solving the 
argmax m © when training SVM^ rf , CAPO cvm and CAPO*. Concretely, on two representative 
data sets Reuters and USPS*, the number of inferences under different C values are recorded 
and Figure Q] shows the results. From Figure Q] (a), we can find that on USPS*, CAPO* and 
CAPO cvm have fewer inferences than SVMj? n , especially when C is large. Since the training cost 
of Algorithm [1] is dominated by the inference, the high efficiency of CAPO* and CAPO cvm is 
due to fewer number of inferences. This can be understood by that auxiliary classifiers provide 
estimates of the target classifier and CAPO searches them, while SVMj^ rf searches in the whole 
function space. On Reuters where three methods have similar time efficiency, we can find from 
Figure Q] (b) that the numbers of inferences are small and similar. This can be understood that 
linear classifier is strong enough for text classification tasks. Moreover, the adaptation procedure 
of CAPO* is more efficient than CAPO cvm , and FigureQ](a) also shows CAPO* has fewer number 
of inferences. This indicates that it may be easier to find the target classifier by using multiple 
auxiliary classifiers, coinciding with the fact that an ensemble can provide better estimate of the 
target classifier. 

Therefore, we can see that the auxiliary classifiers not only inject nonlinearity, but also make the 
classifier adaptation procedure more efficient. 

5.3. Effect of Delta Function 

To show the effect of adding delta function on auxiliary classifiers, we compare the performance 
of CAPO with that of the weighted ensemble of auxiliary classifiers which does not include a 
delta function. In detail, we train five CVMs as auxiliary classifiers due to its high efficiency. 
Each CVM is with one of the following five kernels: 1) RBF kernel /c(xj;xj) = exp(7||xj — Xj|| 2 ); 
2) polynomial kernel fc(xj;Xj) = (7x^~Xj + co) d ; 3) Laplacian kernel fc(xj;xj) = exp(7||xj — 
Xj ||); 4) inverse distance kernel k(xi;Xj) = ~^j|^~^~p^ > an d 5) inverse squared distance kernel 
fc(xj;Xj) = 7 || x ._x j |p+i > where all kernels are with default parameters (co = and d = 3 in the 
polynomial kernel, 7 is the inverse squared averaged distance between examples in all kernels). 
Then, CAPO employs these five CVMs as auxiliary classifiers, and the weighted ensemble learns 
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Table 5: Performance comparison between CAPO and weighted ensemble, where both methods exploit five CVMs 
with different kernels. 



Task 


CAPO Ensemble 


Accuracy 
Fl 

IjCNNl 

PRBEP 
AUC 


.9712 .9632 
.8438 .8439 
.8472 .8000 
.9892 .9837 


Accuracy 
Fl 

Mitfaces 

PRBEP 
AUC 


.9842 .9837 
.4563 .4446 
.4831 .4767 
.9097 .9097 


Accuracy 
Fl 

Reuters 

PRBEP 
AUC 


.9715 .9715 
.7429 .7181 
.7598 .7318 
.9847 .7979 


Accuracy 
Fl 

Splice 

PRBEP 
AUC 


.8952 .8938 
.9024 .9024 
.9010 .8912 
.9486 .9022 


Accuracy 
Fl 

Usps* 

PRBEP 
AUC 


.9706 .9701 
.9659 .9644 
.9634 .9622 
.9823 .9705 



a set of weights to combine them such that the empirical risk is minimized. Both methods select 
C from {2- 7 ,...,2 7 } by 5-fold cross-validation on training data, and B of CAPO is fixed to 1. 

Table [5] presents the performances of two methods. It can be seen that CAPO achieves better 
performance than the weighted ensemble. For example, the weighted ensemble achieves PRBEP 
0.8000 on IjcnnI while CAPO achieves 0.8472; the weighted ensemble achieves AUC 0.9022 but 
CAPO achieves 0.9486 on Splice. Noting that their difference is that CAPO exploits the delta 
function, we can see that by adding the delta function, CAPO achieves performance improvement 
w.r.t. concerned performance measure. 

5.4- Effect of Auxiliary Classifier Selection 

In above experiments, we directly use common learning algorithms to train auxiliary classifiers, 
it is obvious that these auxiliary classifiers are not specially improved according to the concerned 
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Figure 2: Comparison between relative improvement of averaged performance of auxiliary classifiers and relative 
performance improvement of CAPO after auxiliary classifier selection. 

performance measure. Then, a straightforward question is how CAPO performs if the auxiliary 
classifiers are specially improved w.r.t. the concerned performance measure, or in other words how 
CAPO performs if we train auxiliary classifiers according to the concerned performance measure. 
Subsequently, we perform experiments to answer this question. Specifically, rather than training 
five CVMs with five different kernels with default parameters, we train a set of fifty CVMs and 
select five from them as auxiliary classifiers based on the concerned performance measure. In 
detail, these fifty CVMs are trained by independently using the five kernels mentioned above, and 
the parameter 7 for each kernel is set as 7 = 1.5 9 7o, where 6 S {—0.5, 0, 0.5, . . . , 4} and 70 is the 
default value, and then five CVMs which performs best in terms of the concerned performance 
measure are selected as auxiliary classifiers. For example, if we want to train classifier optimizing 
Fl-score, then the five CVMs which achieves the highest Fl-score are selected. As above, we 
choose the parameter C G {2~ 7 , . . . , 2 7 } by 5-fold cross validation and fix B to be 1. 

On each task, we compute the relative improvement of the averaged performance of auxiliary 
classifiers and that of obtained CAPO, and report them in Figure El The relative performance 
improvement is computed as the performance improvement caused by the auxiliary classifier 
selection divided by the performance before selection. From Figure [21 it is easy to see that 
although the averaged performance of auxiliary classifiers improves a lot after selection, yet the 
performance of CAPO keeps similar in most cases, and even degrades in some cases. This may 
suggest that it is enough to use common CVMs as auxiliary classifiers, and it is not needed to 
specially design auxiliary classifiers according to the target performance measure. This can be 
explained that the auxiliary classifiers are used to provide approximate solutions to the problem, 
which are combined and further refined by the delta function to obtain the final solution, thus 
actually these approximate solutions are not required to be very accurate. Moreover, it is obvious 
that with respect to time efficiency, CAPO with auxiliary classifier selection has no superiority 
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over the original one, especially after counting the time used for training fifty auxiliary CVMs. 
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Figure 3: Performance and CPU time (in seconds) with different C's, (a) on USPS*; (b) on Reuters. Each subfigurc 
shows performance in the 1st row and corresponding CPU time in the 2nd row. 



5.5. Parameter Sensibility 

To study the impact of parameters, we perform experiments on two medium-sized data sets 
USPS* and Reuters. The two data sets are representative, since nonlinear classifiers perform 
well on USPS* while linear classifiers work well on Reuters. We study the performance and 
time efficiency of CAPOi and CAPO5 under different C and B values, where CAPOi uses one 
auxiliary CVM with RBF kernel and CAPO5 uses five auxiliary CVMs with five different kernels 
as above, all kernels are with default parameters. 

First, we vary C within {2 -7 , 2 -6 , . . . , 2 7 } and fix B to be 1. For comparison, we also train 
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Figure 4: Performance and CPU time (in seconds) with different B's: (a) on USPS*, (b) on Reuters. Each 
subfigure shows performance in the 1st row and corresponding CPU time in the 2nd row. 

SVM^ rf and SVM^f with the same C's. Figure [3] shows the results. It can be found that 
CAPOi and CAP0 5 generally outperform SVMP erf at different C's, except that SVMj^f achieves 
comparable performance as CAPO for PRBEP and AUC on USPS* and SVM^ rf performs bet- 
ter for AUC at large C's on Reuters. With respect to time efficiency, CAPOi, CAPO5 and 
SVMj^ rf cost comparable CPU time, which is much less than SVM^ f . Moreover, CAPOi and 
CAPO5 scales better when C increases, and they are more efficient than SVM^ rf at large C's. 
Moreover, it is easy to find that our methods, especially CAPO5, are more robust with C. 



Second, we vary B within {2~ 7 , 2~ 6 , . . . , 2 7 } with fixed C = 1 for CAPOi and CAPO5. As 
comparisons, SVMj^ rf and SVM^ f are trained C = 1. The results are shown in Figure HI where 
SVMj^ r and SVM^[ f are illustrated as straight lines because they do not have the parameter B. 
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Figure 5: Performance (1st row) and CPU time (2nd row; in seconds) with different training set sizes on IjcnnI. 

In general, CAPOi and CAPO5 achieve better performance at different B 7 s in most cases, except 
for AUC on Reuters. Also, CAPOi and CAPO5 have comparable efficiency with SVM^ rf , which 
is much better than SVMj^f • We can see that our methods are quite robust to parameters B, 
and comparatively speaking, CAPO5 is more robust than CAPOi. 



Thus, we can see that our methods, especially CAPO5, are robust to B and C. Comparatively 
speaking, CAPO5 is more robust and more efficient than CAPOi, this verifies our previous results. 



5.6. Scalability w.r.t. Training Set Size 

To evaluate scalability of CAPO, we perform experiments on the largest data set IjcnnI. We 
first train CAPOi and CAPO5 using {1/32, 1/16, 1/8, 1/4, 1/2, 1} of all training examples, and 
then evaluate them on test examples. As comparisons, SVMj^ rf and SVM^ f are also trained 
under the same configuration. In this experiment, we simply fix both the parameters B and C 
to be 1. We report performance of compared methods and the corresponding used CPU time. 

Figure [5] shows the results of the achieved performance and the corresponding running time in 
first and second row respectively. As we can see, all methods scale well except that SVM^ f has 
to be terminated early when the training set size increases. Moreover, compared with SVM^ rf , 
it is easy to see that CAPO5 achieves better performance but costs less time at every training 
set size. 
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5. 7. Summary 

Based on above empirical studies, we can see that CAPO is an effective and efficient approach 
to training classifier that optimizes performance measures. Compared with SVM pcrf and SVM 
with cost model, it can achieve better performances at lower time costs. As well, it has been 
shown that CAPO is robust to parameters and scales well w.r.t. the training data size. For 
practical implementation, training auxiliary classifiers by optimizing accuracy is a good choice, 
because many efficient algorithms have been developed in the literature, and the experiments in 
Section T5.4I suggest that using auxiliary classifiers with higher target performances does not show 
significant superiority, especially when tuning auxiliary classifiers costs much time. Meanwhile, 
it can be better to use multiple diverse auxiliary classifiers. 

6. Conclusion and Future Work 

This paper presents a new approach CAPO to training classifier that optimizes specific perfor- 
mance measure. Rather than designing sophisticated algorithms, we solve the problem in two 
steps: first, we train auxiliary classifiers by taking existing off-the-shelf learning algorithms; then 
these auxiliary classifiers are adapted to optimize the concerned performance measure. We show 
that the classifier adaptation problem can be formulated as an optimization problem similar to 
linear SVM perf and can be efficiently solved. In practice, the auxiliary classifier (or ensemble of 
auxiliary classifiers) benefits CAPO in two aspects: 

1. By using nonlinear auxiliary classifiers, it injects nonlinearity that is quite needed in prac- 
tical applications; 

2. It provides an estimate of the target classifier, making the classifier adaption procedure 
more efficient. 

Extensive empirical studies show that the classifier adaptation procedure helps to find the target 
classifier for the concerned performance measure. Moreover, the learning process becomes more 
efficient than linear SVM perf , due to fewer inferences in CAPO. 

In this work, linear delta function is used for classifier adaptation. Although it achieves good 
performances, an interesting and promising future work is to exploit nonlinear delta function for 
this problem. 
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